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Objective: Gene-by-environment inter- 
action (GXxE) studies in psychiatry have 
typically been conducted using a candi- 
date GxE (cGXxE) approach, analogous to 
the candidate gene association approach 
used to test genetic main effects. Such 
cGXE research has received widespread 
attention and acclaim, yet cGXxE_ find- 
ings remain controversial. The authors 
examined whether the many positive 
cGXE findings reported in the psychiatric 
literature were robust or if, in aggregate, 
cGXE findings were consistent with the ex- 
istence of publication bias, low statistical 
power, and a high false discovery rate. 


Method: The authors conducted analy- 
ses on data extracted from all published 
studies (103 studies) from the first decade 
(2000-2009) of cGxE research in psychiatry. 


Results: Ninety-six percent of novel cGxE 
studies were significant compared with 


27% of replication attempts. These find- 
ings are consistent with the existence 
of publication bias among novel cGXE 
studies, making cGxE hypotheses ap- 
pear more robust than they actually are. 
There also appears to be publication bias 
among replication attempts because posi- 
tive replication attempts had smaller av- 
erage sample sizes than negative ones. 
Power calculations using observed sam- 
ple sizes suggest that cGXE studies are 
underpowered. Low power along with 
the likely low prior probability of a given 
cGxE hypothesis being true suggests that 
most or even all positive cGxE findings 
represent type | errors. 


Conclusions: In this new era of big data 
and small effects, a recalibration of views 
about groundbreaking findings is neces- 
sary. Well-powered direct replications 
deserve more attention than novel cGXE 
findings and indirect replications. 


(Am J Psychiatry 2011; 168:1041-1049) 





CGone-ty environmen interactions (GxEs) occur 
when the effect of the environment depends on a person's 
genotype or, equivalently, when the effect of a person’s 
genotype depends on the environment. GxE research has 
been a hot topic in fields related to human genetics in re- 
cent years, perhaps particularly so in psychiatry. The first 
decade (2000-2009) of GxE research on candidate genes in 
psychiatry saw the publication of over 100 findings, many 
of them in top journals such as Science and the Journal of 
the American Medical Association. Such a large number of 
GxE studies in high-impact publications raised the promi- 
nence of GxE research in psychiatry and increased its ap- 
peal to scientists eager to build on past successes. 

The excitement about GxE research also stems from its 
explanatory potential and the expectation that GxEs are 
common in nature. Genotypes do not exist in a vacuum; 
their expression must depend to some degree on envi- 
ronmental context. For example, genetic variants influ- 
encing tobacco dependence can have this effect only in 
environments where exposure to tobacco can occur. Simi- 
larly, GxEs could provide compelling explanations for why 
one person becomes depressed in response to severe life 
stressors while another does not (1), or why cannabis use 


increases risk for psychosis in one person but not in anoth- 
er (2). Indeed, it would be astonishing if GxEs did not exist, 
for this would mean that reactions to the environment are 
among the only nonheritable phenotypes (3). Consistent 
with this expectation, twin analyses convincingly demon- 
strate that at least some responses to the environment are 
heritable (4). Given these general reasons to expect that 
GxEs are common, most of the focus in psychiatric stud- 
ies over the past decade has been on determining the spe- 
cific genetic variants and environmental risk factors that 
underlie GxEs. In this article, we focus on such measured 
GxE studies as opposed to “latent variable” GxE studies, 
in which omnibus genetic risk is estimated using twins or 
other relatives. 

The enthusiasm for GxE research has recently been tem- 
pered by increasing skepticism (5-7). Critics worry about 
the multiple testing problem combined with publication 
bias against null results (6). The large number of potential 
GxE hypotheses—because of the many variables, opera- 
tional definitions, and analyses that can be conducted— 
creates a large number of testable hypotheses, and there 
is a risk that only the “most interesting” (i.e., significant) 
findings will be published. To the degree that this occurs, 
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the GxE literature contains an inflated number of false 
positives. Additionally, power to detect interactions is typi- 
cally lower than power to detect main effects (8), so the 
difficulties in detecting genetic main effects to date (9, 10) 
may portend even more difficulties in detecting true inter- 
actions. Furthermore, interactions are sensitive to the scale 
on which the variables are measured (11). Altering the scale 
(e.g., taking the logarithm of the dependent variable) can 
cause interactions to disappear, even so-called crossover 
interactions that are supposedly insensitive to scale (7). 

Perhaps most centrally, almost every GxE study con- 
ducted to date has used a candidate gene-by-environment 
interaction (CGxE) approach, whereby both genetic and 
environmental variables were hypothesized a priori. This 
is not an easy task given the inchoate understanding of 
the genotype-to-phenotype pathways in psychiatric dis- 
orders. Indeed, genome-wide association studies (GWAS) 
have largely failed to replicate reported associations from 
the candidate gene literature (12-14; however see Lasky- 
Su et al. [15]). Thus, there is reason to question whether 
the candidate gene approach will be more successful in 
detecting replicable interactions than it has been in de- 
tecting replicable main effects. 

Given such strongly polarized sentiments about cGxE 
research—excitement about the promise of cGxE research 
on the one hand and concern about the high rate of false 
positives on the other—we decided to survey the pattern 
of cGxE results in psychiatry in order to gauge whether 
there was evidence supporting the critics’ concerns or 
whether the pattern of reported cGxE results was indica- 
tive of robust and promising findings. A formal meta- 
analysis across the entire cGxE field in psychiatry is not 
possible given the wide variety of interactions that have 
been examined. Nevertheless, by examining the patterns 
of cGxE findings, collapsed across the varied hypotheses 
investigated to date, we have attempted to gain some le- 
verage on the state of cGxE findings overall. 


Included Studies 


We attempted to identify all cGxE studies published in 
the first decade (2000-2009) of cGxE research in psychia- 
try. We conducted searches using MEDLINE, PubMed, and 
Google Scholar, and we searched the reference sections of 
cGxE papers. Phenotypes in cGxE studies had to be DSM-IV 
diagnoses or closely related constructs (e.g., neuroticism). 
Only observational, as opposed to experimental, studies 
were included; pharmacogenetic studies were excluded. 
Studies were included only if there was variation across 
participants for phenotypic, genetic, and environmental 
variables (e.g., exposure-only designs were excluded). 

In total, 98 articles encompassing 103 studies met inclu- 
sion criteria (five of the 98 articles reported results for two 
independent samples). A list of included and excluded 
studies and how they were coded is provided in the data 
supplement that accompanies the online edition of this 
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article. Analyses were limited to interactions discussed in 
the abstracts of articles because results not mentioned in 
the abstract were often described in insufficient detail for 
accurate categorization. Each of the 103 studies was clas- 
sified either as novel (containing no previously reported 
interactions) or as a replication attempt of a previously 
reported interaction. Replication attempts were defined 
as reports of an earlier cGxE finding in a separate article 
in which 1) the phenotypic variable was identified with 
the same name as the variable in the original report, even 
if specific scales differed (e.g., depression could be mea- 
sured via self-report or clinician diagnosis); 2) the genetic 
polymorphism and genetic model (e.g., additive) were 
the same as in the original study; 3) the environmental 
moderator was substantively the same; and 4) replication 
results were reported for the same gender as the original 
report. Because of the inherent subjectivity involved in 
determining whether environmental moderators such 
as “stressful life events,” “maltreatment,” and “hurricane 
exposure” should be considered equivalent, we deferred 
to the primary authors regarding whether specific envi- 
ronmental variables measured the same construct. When 
possible, we report whether the original finding was actu- 
ally replicated (p<0.05 in the same direction) for a given 
study. For example, Brummett et al. (16) present signifi- 
cant results of a three-way interaction, but we used the 
clearly nonsignificant results of the two-way interaction 
that tested the original hypothesis (17). When we could 
not clearly discern whether the original study was repli- 
cated, the replication attempt was excluded. Replication 
attempts were excluded for the following reasons: genetic 
model discrepancies (nine studies), gender discrepancies 
(eight studies), insufficient information (two studies), and 
replication attempt within the original report (one study). 


Publication Bias Among Novel Reports 
of cGxE Studies 


Publication bias, the tendency to publish significant 
results more readily than nonsignificant ones, is wide- 
spread in biomedical research (18). While understandable 
given journal editors’ motivation to publish findings with 
greater impact (typically novel, significant findings) and 
authors’ decisions not to submit null findings (which re- 
quire more work but have less payoff), publication bias is 
problematic because it produces a distorted representa- 
tion of findings in an area of study (19). 

An indirect way to gauge the degree to which publica- 
tion bias has occurred in novel studies (first reports of 
particular interactions) is to compare the rate of positive 
(significant) results among novel cGxE studies to the rate 
of positive results (that significantly replicated the original 
finding) among replication attempts. Replication attempts 
should more accurately reflect the true rate of positive 
cGxE findings because both positive and null replication 
results will be of interest to readers and be deemed pub- 
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lishable. Novel reports, on the other hand, may be deemed 
publishable only when positive. If so, publication bias will 
manifest as a higher rate of positive results among novel 
cGxE studies than among replication attempts. Consistent 
with this expectation, 96% (45/47) of novel cGxE findings 
were positive, but only 27% (10/37) of replication attempts 
were positive (Fisher’s exact test, p=1.29x107!). This p 
value should be interpreted with caution because many 
of the replication attempts were not independent of each 
other (e.g., the 5-HTTLPR-by-stressful life events inter- 
action predicting depression was tested multiple times). 
Consequently we reran the analysis, excluding all but the 
first published replication attempt for each interaction. 
Despite the reduction in number of data points and the 
attendant loss of power, the results remained highly sig- 
nificant: 22% (2/9) of first replication attempts were posi- 
tive, compared with 96% (45/47) of novel studies (Fisher’s 
exact test, p=5.2x10~), 

These results are consistent with the hypothesis of 
widespread publication bias among novel cGxE reports, 
suggesting that many more tests of novel interactions 
have been conducted than reported in the literature. Giv- 
en that increasing publication bias leads to an increasing 
field-wise type I error rate (because negative results go un- 
published), these findings provide a clear warning against 
premature acceptance of novel cGxE findings. 


Publication Bias Among Replication 
Attempts of cGXxE Studies 


The analysis above relies on the assumption that rep- 
lication attempts provide a more accurate reflection of 
the true rate of positive cGxE findings than do novel stud- 
ies. While probably true, publication bias may also exist 
among replication attempts themselves, meaning that 
less than 27% of replication attempts are actually positive. 
To test for evidence consistent with this possibility, we 
compared sample sizes of positive (significant and in con- 
sistent direction) replication attempts and negative (non- 
significant or opposite direction) replication attempts. 

In the absence of publication bias, and when the hy- 
potheses being tested are true, positive replication at- 
tempts should tend to have larger sample sizes than 
negative replication attempts because, holding effect size 
constant, larger samples provide greater statistical power 
(20). This pattern of results—larger replication studies be- 
ing more likely to be significant—occurs in fields where 
the relationships being tested have proven robust, such 
as the smoking-cancer link (21). However, in the presence 
of publication bias, the opposite pattern of results could 
be observed—smaller replication studies may be more 
likely to be significant. This would occur if larger replica- 
tion attempts were published irrespective of the direction 
of the results, whereas smaller studies were preferentially 
published when they yielded positive results. Consistent 
with the presence of publication bias among replication 
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FIGURE 1. Testing for Publication Bias in Replication At- 
tempts of Candidate Gene-by-Environment (cGXE) Interac- 
tion Research? 
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Replication Attempt Status 


4This figure shows boxplots of sample sizes for three classifications 
of replication studies in cGXE interaction research. Positive replica- 
tions significantly replicated (p<0.05) a previous cGxE effect. Nega- 
tive replications failed to replicate a previous cGXE effect. Pure 
negative replications (a subset of negative replication attempts) 
failed to replicate a previous cGXxE effect and were not published 
alongside other positive cGXxE findings. Boxes are first and third 
quartiles; black lines represent whiskers (maximum and minimum 
non-outlier values). Outliers (values beyond 1.5 box lengths from 
the first or third quartile) are shown as points. 


attempts (Figure 1), the median sample size of the 10 posi- 
tive replication attempts was 154, whereas the median 
sample size of the 27 negative replication attempts was 
377 (Wilcoxon rank-sum test, T=56, p=0.007). The non- 
parametric Wilcoxon rank-sum test was used because 
sample sizes were highly skewed, but results here and be- 
low were also significant using parametric tests. 

We used one additional, independent approach to test 
for evidence consistent with publication bias among rep- 
lication attempts, hypothesizing that negative replication 
attempts may be published more readily when reported 
with some other novel, positive cGxE finding. Consistent 
with this, 63% (17/27) of negative replication attempts 
were reported with novel, positive cGxE findings where- 
as only 20% (2/10) of positive replication attempts were 
published with novel, positive cGxE findings (Fisher’s ex- 
act test, p=0.03). Moreover, it appears that much larger 
sample sizes are needed in order for negative replication 
attempts to be published: the median sample size of the 
10 “pure negative” replication attempts (not published 
alongside another novel, positive cGxE finding) was 1,019, 
which is more than six times larger than the median sam- 
ple size (N=154) of the 10 positive replication attempts 
(T=9, p=0.001; see Figure 1). 

Although publication bias is the obvious explanation 
for these otherwise counterintuitive findings, systematic 
differences between smaller and larger studies may also 
play a role. For example, Caspi et al. (1) and Lotrich and 
Lenze (22) argued that smaller cGxE studies tend to use 
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higher-precision prospective measures, whereas larger 
studies tend to use lower-precision retrospective reports. 
If so, smaller replication studies may be more likely to be 
positive because they tend to analyze variables with less 
measurement error than larger replication studies and 
not because of publication bias. However, this argument 
does not explain why negative replications are published 
alongside novel cGxE findings more often than positive 
replications or why negative replications published alone 
have the largest sample sizes. Taken together, we believe 
that publication bias among replication attempts is the 
most parsimonious explanation for our results. 


Power to Detect cGxXEs 


Statistical power is the probability of detecting a signifi- 
cant result given that the alternative (here, cGxE) hypoth- 
esis is true. Statistical power has been a central issue in 
modern psychiatric genetics, and it is likely that most can- 
didate gene studies have been underpowered (23). Several 
studies have likewise investigated the statistical power of 
cGxE studies (24-27) and have concluded that power to 
detect cGxE interactions is even lower, sometimes much 
lower, than power to detect genetic or environmental 
main effects. Low statistical power in a field is problemat- 
ic, not only because it implies that true findings are likely 
to be missed, but also because low power increases the 
proportion of significant “discoveries” in a field that are 
actually false. 

Interactions are tested by multiplying two first-order 
(here, gene and environment) predictors together, creat- 
ing a product term. All three variables (the two first-order 
variables and the product term) are entered into the mod- 
el, and a significant product term is evidence for interac- 
tion effect. It is often argued (e.g., in Caspi et al. [1]) that 
the reduction in power to detect interaction effects is due 
to the correlation between the product term and the first- 
order predictors, but this is incorrect; the correlation be- 
tween the product and the first-order terms plays no role 
in the power to detect interactions (8). This can be seen 
by centering (subtracting the mean from) symmetrically 
distributed first-order predictors, which reduces the cor- 
relation between product and first-order terms to ~0 but 
does not change the significance level of the product term. 
(The same effect occurs for nonsymmetrically distributed 
predictors, although the constant subtracted will not be 
the mean; see Smith and Sasaki [28].) 

The primary reason that power to detect interactions 
tends to be low is that the variance of the product term 
tends to be low in nonexperimental studies (8). Power to 
detect the effect of any predictor, including a product term, 
increases as a function of the variance of that predictor. 
The variance of product (here, cGxE) terms is maximized 
when subjects are selected from the joint extremes (high 
G-high E, low G-high E, high G-low E, and low G-low E) 
of the two first-order predictors, but such jointly extreme 
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observations tend to be rare in nonexperimental studies 
(8). This issue is particularly relevant to cGxE studies, as 
it is generally not possible to sample from the genotypic 
extremes (e.g., equal numbers of the two homozygotes). 
Thus, power in cGxE studies will be maximized whenever 
variance in the two first-order predictors is maximized, 
that is, when the minor allele frequencies are high (e.g., 
0.50 for biallelic loci) and when equal numbers of subjects 
are exposed to the extremes of the environmental moder- 
ator (25). Additional factors such as ascertainment strate- 
gy (29), study design (30), correlation between the genetic 
and environmental variables (8), and measurement error 
in the variables (23) also affect statistical power to detect 
cGxE effects and should be considered in interpreting re- 
sults from cGxE studies. 

In Figure 2A, we provide power estimates for cGxEs giv- 
en three different effect sizes and plot them above a histo- 
gram of actual sample sizes from the first decade of cGxE 
studies (Figure 2B). Power estimates were derived from 
10,000 Monte Carlo simulations with alpha set to 0.05. We 
assumed that no error occurred in any of the measures and 
that the environmental and genetic variables accounted 
for 20% and 0.5% of the variance in the outcome variable, 
respectively. These are favorable values for the detection 
of GxE effects because increasing variance accounted for 
by the first-order terms increases power to detect an inter- 
action term in linear regression. 

In Figure 2A, the three lines depict statistical power 
for three different possible cGxE effect sizes. As a point 
of reference, the effect size designations in Figure 2A re- 
flect what would be considered very large (r=0.10), large 
(r?=0.01), and moderate (r?=0.001) for genetic main effects 
in large GWAS, which provide the most reliable informa- 
tion about the true effect sizes of genetic main effects (31). 
We used these effect sizes to provide points of reference, 
although it is possible that GxE effects tend to be larger or 
smaller than genetic main effects. 

Sample sizes from the 103 cGxE studies are depicted 
in Figure 2B. The median sample size, shown as a verti- 
cal line in Figure 2A, was 345. Assuming a moderate ef- 
fect size of r2=0.001, statistical power was less than 10% 
for the median sample size. Given large and very large ef- 
fect sizes, cGxE studies required sample sizes of ~600 and 
~50 to reach sufficient statistical power (80%) to reject the 
null. In sum, unless cGxE effect sizes are over an order of 
magnitude larger than the typical genetic main effect sizes 
detected in GWAS, then cGxE studies have generally been 
underpowered, perhaps severely so, a conclusion also 
reached by others (23, 32). 


The False Discovery Rate in cGXE 
Research in Psychiatry 


A necessary, albeit underappreciated, consequence of 
low power is that it increases the false discovery rate—the 
proportion of “discoveries” (significant results) in a field 
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FIGURE 2. Power as a Function of Sample Size for Three Potential cGxE Effect Sizes (Panel A) and Distribution of Observed 


Sample Sizes in the cGxE Literature (Panel B) 
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that actually represent type I errors (33, 34). Other factors 
influencing the false discovery rate are the chosen type I 
error rate (typically w=0.05) and the proportion of tested 
hypotheses that are correct (the prior). Given these pa- 
rameters, calculation of the false discovery rate is straight- 
forward: 

o(1-prior) 


ic o(1-prior) + (power * prior) 





In addition to suggesting low power, candidate gene 
main effect research in psychiatry suggests that the priors 
in cGxE research may also be low. For one thing, candi- 
date gene main effect studies in psychiatry have yielded 
no unequivocally accepted associations after more than 
a decade of intense efforts (10), despite the fact that can- 
didate gene main effect hypotheses were predicated on 
robust neurobiological findings. In contrast, GWAS have 
identified numerous replicable associations that have not 
usually been in candidate genes: out of 531 of the most 
robustly associated single-nucleotide polymorphisms 
(SNPs) to various medical and psychiatric phenotypes in 
GWAS studies, 45% were in introns, 43% were in intergenic 
regions, and only 11% were in exons (35), the typical hunt- 
ing ground for candidate polymorphisms. Furthermore, 
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when candidate polymorphisms have been examined 
among GWAS results, they have usually not demonstrated 
better than chance performance (12-15). 

Thus, accumulating evidence suggests that our under- 
standing of the neurobiological underpinnings of psychi- 
atric disorders has, to date, typically been insufficient to 
lead to correct hypotheses regarding candidate polymor- 
phisms. Colhoun et al. (9) estimated that 95% of candidate 
gene main effect findings were actually false positives, 
which translates to a prior of between 0.3% and 3% (as- 
suming statistical power is between 10% and 90%). Be- 
cause of the need also to specify the correct moderating 
environmental variable, generating cGxE hypotheses that 
prove correct may be even more difficult than generating 
(simpler) genetic main effect hypotheses. Thus, the prior 
for cGxE studies may be lower than the 0.3% to 3% it ap- 
pears to be for candidate gene main effect hypotheses. 

Figure 3 shows the false discovery rate as a function of 
varying assumptions about power and the prior. If cGxE 
hypotheses prove to be like candidate gene hypotheses, 
with (optimistic) values of the prior and power of 5% and 
55%, respectively, then approximately two-thirds (63%) 
of positive findings would represent type I errors. Using 
values of the prior (1%) and statistical power (10%) that 
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FIGURE 3. The False Discovery Rate as a Function of Statis- 
tical Power and the Prior (Percentage? of Hypotheses That 
Are True)? 
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may be more realistic, the false discovery rate is 98%. Ob- 
viously, the true false discovery rate in the cGxE field may 
be higher, lower, or in between these values. 


The 5-HTTLPR-by-Stressful Life Events 
Interaction Example 


In 2003, Caspi and colleagues (17) reported an in- 
creasingly positive relationship between number of self- 
reported stressful life events and depression risk among 
individuals having more short alleles at the serotonin 
transporter (5-HTTLPR) polymorphism. Their study has 
been extremely influential, having tallied over 3,000 ci- 
tations and a large number of replication attempts. We 
reviewed this specific cGxE hypothesis and the attempts 
to replicate it because it highlights the important issue of 
direct compared with indirect replications and because it 
potentially illustrates the issues surrounding publication 
bias and false discovery rates discussed above. 

Both direct cGxE replications, which use the same sta- 
tistical model on the same outcome variable, genetic poly- 
morphism, and environmental moderator tested in the 
original report, and indirect cGxE replications, which rep- 
licate some but not all aspects of an original report, exist 
in the cGxE literature. Indirect replications might some- 
times be conducted to help understand the generalizabil- 
ity of an original report (1) and might in other cases be 
conducted out of necessity because available variables do 
not match those in the original report. However, it is also 
possible that in an unknown number of cases, a positive 
indirect replication was discovered by testing additional 
hypotheses after a direct replication test was negative. 
Sullivan (36) showed that when replications in candidate 
gene association (main effect) studies are defined loose- 
ly, the type I error rate can be very high (up to 96% in his 


1046 ajp.psychiatryonline.org 


simulations). The possibilities for loosely defined, indirect 
replications are even more extensive in cGxE research 
than in candidate gene main effect research because of 
the additional (environmental) variable. Thus, we believe 
it is important that only direct replications are considered 
when gauging the validity of the original cGxE finding (see 
also Chanock et al. [37]). Once an interaction is supported 
by direct replications, indirect replications can gauge the 
generalizibilty of the original finding, but until then they 
should be considered novel reports, not replications. 

The decision of how indirect a replication attempt can 
be in order to be included in a review or meta-analysis is 
critical for gauging whether a finding has been support- 
ed in the literature. With respect to the interaction of 5- 
HTTLPR and stressful life events on depression, a meta- 
analysis by Munafo et al. (38) and subsequent meta- and 
mega-analysis by Risch et al. (5) examined results and/or 
data from 14 overlapping but not identical replication at- 
tempts and failed to find evidence supporting the original 
interaction reported by Caspi et al. (17). However, a much 
more inclusive meta-analysis by Karg et al. (39) looking at 
56 replication attempts found evidence that strongly sup- 
ports the general hypothesis that 5-HTTLPR moderates 
the relationship between stress and depression. Karg et 
al. argue that these contradictory conclusions were main- 
ly caused by the different sets of studies included in the 
three analyses. Karg et al. included studies that Munafo et 
al. (38), Risch et al. (5), and we, in this report, consider to 
be indirect replications. For example, Karg et al. included 
studies investigating a wide range of alternative environ- 
mental stressors (e.g., hip fractures), alternative outcome 
measures (e.g., physical and mental distress), and alter- 
native statistical models (e.g., dominant genetic models). 
Furthermore, 11 studies included in the Karg et al. anal- 
ysis used “exposure only” designs that investigate only 
those individuals who have been exposed to the stressor. 
We excluded such designs in this review because they do 
not actually test interactions; rather, interactions must be 
inferred by assuming an opposite or no relationship be- 
tween the risk allele and the outcome in nonexposed in- 
dividuals. Additionally, the result from at least one of the 
studies deemed supportive of the interaction in Karg and 
colleagues’ meta-analysis (40) is actually in the opposite 
direction of the original finding when the same statistical 
model employed in the original report is used (5). Taken 
together, the pattern of results emerging from these three 
meta- and mega-analyses is surprisingly consistent: direct 
replication attempts of the original finding have gener- 
ally not been supportive, whereas indirect replication at- 
tempts generally have. 

There also appears to be evidence of publication bias 
among the studies included in the Karg et al. (39) article. 
As we have shown to be the case in the broader cGxE liter- 
ature, larger studies included in the Karg et al. meta-anal- 
ysis were less likely to yield significant results. A logistic 
model regressing replication status (significant replica- 
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tion compared with not) on sample size among studies in- 
cluded in their meta-analysis found that the odds of a sig- 
nificant replication of Caspi’s original finding decreased 
by 10% for every additional 100 participants (B=—0.001, 
p=0.02). 

Karg et al. (39) touch on the possibility of publication 
bias affecting their results by calculating the fail-safe ra- 
tio. They note that 14 studies would have to have gone 
unpublished for every published study in order for their 
meta-analytic results to be nonsignificant. While this ratio 
is intended to seem unreachably high, a couple of points 
should be kept in mind. First, the fail-safe ratio speaks not 
to unpublished studies but rather to unpublished analy- 
ses. As discussed above, possibilities for alternative analy- 
ses (i.e., indirect cGxE replications) abound: alternative 
outcome, genotypic, and environmental variables can be 
investigated; covariates or additional moderators can be 
added to the model; additive, recessive, and dominant ge- 
netic models can be tested; phenotypic and environmen- 
tal variables can be transformed; and the original finding 
can be tested in subsamples of the data. We observed each 
of these situations at least once among studies “consistent 
with” or “replicating” the original 5-HTTLPR-by-stressful 
life events interaction, and such indirect replications can 
have a high false positive rate. Second, and most impor- 
tantly, Karg et al. used extremely liberal inclusion criteria, 
analyzing many indirect replications that we either clas- 
sified as novel studies or excluded completely. Thus, the 
findings of Karg et al. and the findings we present here 
recapitulate one another; almost all novel studies (our re- 
view) and indirect replications (the Karg et al. meta-analy- 
sis) are positive, whereas most direct replications are not. 
This suggests that positive meta-analytic findings become 
more likely as study heterogeneity increases. Notably, this 
is exactly the opposite of what would be expected if the 
original results were true. Stricter replication attempts 
should be more likely, not less likely, to be significant. 
Rather than interpreting the fail-safe ratio as evidence that 
the 5-HTTLPR-by-stressful life events interaction has rep- 
licated, this ratio might be better interpreted as providing 
a rough estimate of how large the “file drawer problem” is 
in the cGxE field. 


Conclusion 


Despite numerous positive reports of cGxEs in the psy- 
chiatric genetics literature, our findings underscore sev- 
eral concerns that have been raised about the cGxE field 
in psychiatry. Our results suggest the existence of a strong 
publication bias toward positive findings that makes 
cGxE findings appear more robust than they actually are. 
Almost all novel results are positive, compared with less 
than one-third of replication attempts. More troubling 
is evidence suggesting that replication studies, generally 
considered the sine qua non of scientific progress, are also 
biased toward positive results. Furthermore, it appears 
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that sample sizes for null replication results must be ap- 
proximately six times larger than sample sizes for positive 
replication results in order to be deemed publishable on 
their own. Such a publication bias among replication at- 
tempts suggests that meta-analyses, which collapse across 
replication results for a given cGxE hypothesis, will also 
be biased toward being unrealistically positive. Although 
methods exist to detect publication biases (e.g., the fun- 
nel plot), they are not very sensitive, and correcting meta- 
analytic results for this bias is difficult (41). Finally, our 
findings suggest that meta-analyses using very liberal in- 
clusion thresholds (e.g., Karg et al. [39]) are virtually guar- 
anteed to find positive results. 

The statistical power to detect cGxE effects is another 
important consideration. Unless cGxE effects are many 
times larger than typical genetic main effects, most cGxE 
studies conducted to date have been underpowered. This 
has several implications. The most obvious is that true GxE 
effects may often go undetected. However, low power also 
increases the rate of false discoveries across a field. Given 
the potentially low prior probability of true cGxE hypothe- 
ses, stemming from the difficulty of identifying the correct 
genetic and environmental variables, the false discovery 
rate in cGxE research in psychiatry could be very high; the 
possibility that most or even all positive cGxE findings in 
psychiatry discovered to date represent type I errors can- 
not be discounted. 

For scientific progress to be made in the cGxE field, it is 
crucial to begin to differentiate the true cGxE effects from 
the false. How can this be accomplished? One step for- 
ward would be to encourage authors to submit and editors 
to accept null reports in order to reduce the publication 
biases present in the field, but incentives to publish posi- 
tive reports are unlikely to change for either authors or ed- 
itors anytime soon. Perhaps a more realistic way to begin 
discerning true results in the cGxE field is to acknowledge 
that false positive results are a natural consequence of the 
incentive structure that exists in modern science, and that 
because of this, authors, consumers, editors, and review- 
ers should recalibrate their views on what constitutes an 
important scientific contribution. Given the likely high 
false positive rate among novel findings (19) and indirect 
replications (36, 37) and the low false positive rate among 
direct replications (36), well-powered studies conducted 
with the express purpose of closely replicating previous 
findings should be viewed as more scientifically impor- 
tant than novel “groundbreaking” cGxE results or indirect 
replications. The practice of according the most prestige 
to novel findings contributes to the ambiguous state of 
cGxE research and potentially to the proliferation of type 
I errors. 

This review should not be taken as a call for skepticism 
about the GxE field in psychiatry. We believe that GxEs 
are likely to be common and that they may well prove to 
be important or even central for understanding the etiol- 
ogy of psychiatric disorders. At issue is how to separate 
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the wheat from the chaff: which GxE findings are repli- 
cable and illuminating, and which are spurious and lead 
to wasted resources, false hope, and increased skepti- 
cism? Scientists investigating genetic main effects using 
genome-wide association methods have made minimiz- 
ing false discoveries a central creed of their enterprise (10). 
Indeed, the benefits of comprehensive SNP coverage and 
a conservative alpha have yielded hundreds of robust and 
replicable genetic associations. Such genome-wide meth- 
ods have been proposed for the study of GxEs (42) and will 
undoubtedly prove informative, but this is not the only 
solution. Rather, true progress in understanding GxEs in 
psychiatry requires investigators, reviewers, and editors to 
agree on standards that will increase certainty in reported 
results. By doing so, the second decade of GxE research in 
psychiatry can live up to the promises made by the first. 
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